-
Notifications
You must be signed in to change notification settings - Fork 2
fix: ensure node state persisted before shutdown #743
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: fix/node-stopping-bg-payments
Are you sure you want to change the base?
fix: ensure node state persisted before shutdown #743
Conversation
This comment has been minimized.
This comment has been minimized.
|
DIdn't find problems in my tests, but this has a timing very difficult to reproduce |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Probably still good to use non-main thread
Look at runBlocking(dispatcher) option
The execution already switches to bgDispatcher because of |
|
todo: check iOS |
…ll keep running.This prevents the node be stuck on stopping state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
I found one issue that should be addressed:
can consume entire timeout budget, preventing from executing
File: app/src/main/java/to/bitkit/services/LightningService.kt
The new syncWallets() call before shutdown shares the same 5-second timeout budget (from onDestroy) as the critical node.stop() call. Since syncWallets() is a network operation without its own timeout, it could take the full 5 seconds, leaving no time for node.stop() to execute.
This is problematic because:
syncWallets()is a network call to VSS that can be slow or hang- If it takes most/all of the 5 seconds,
node.stop()never runs - This leaves the node in a running but unmanaged state - worse than the original problem
In onTimeout(), the budget is only 2 seconds, making this even more likely to occur.
Suggested fix: Give syncWallets() its own sub-timeout (e.g., 2 seconds out of the 5-second budget) to guarantee node.stop() has time to execute:
runCatching {
Logger.debug("Performing final sync before shutdown…", context = TAG)
withTimeoutOrNull(2_000L) {
ServiceQueue.LDK.background { node.syncWallets() }
} ?: Logger.warn("Sync timed out, proceeding with shutdown", context = TAG)
Logger.debug("Final sync completed", context = TAG)
}.onFailure { Logger.warn("Final sync failed, proceeding with shutdown", it, context = TAG) }This ensures node.stop() always has at least 3 seconds to execute properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
I've reviewed this PR and found 2 significant issues that should be addressed before merging.
| // On cancellation (e.g., timeout), ensure state is recoverable | ||
| if (it is CancellationException) { | ||
| Logger.warn("Node stop cancelled, forcing Stopped state for recovery", context = TAG) | ||
| _lightningState.update { LightningState(nodeLifecycleState = NodeLifecycleState.Stopped) } | ||
| return@withLock Result.failure(it) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: Catching CancellationException without rethrowing breaks coroutine cancellation contract
This code catches CancellationException via runCatching and wraps it in Result.failure() without rethrowing. This violates Kotlin's coroutine cancellation contract, which requires CancellationException to always be rethrown for proper structured concurrency.
The same file demonstrates the correct pattern in multiple places:
- Line 223:
// Cancellation is expected during pull-to-refresh, rethrow per Kotlin best practicesfollowed byif (it is CancellationException) throw it - Line 861: Same pattern
Impact: While this happens to work for the primary onDestroy call site (since withContext(bgDispatcher) may detect parent cancellation independently), stop() is also called from:
wipeStorage()(line 543)restartWithElectrumServer()(line 562)restartWithRgsServer()(line 588)restartWithPreviousConfig()(line 609)restartNode()(line 1076)
If any of these call sites' coroutines are cancelled while stop() is running, the CancellationException will be silently swallowed, and subsequent code will execute when it should not.
Suggested fix: Rethrow the CancellationException after performing the state recovery:
| // On cancellation (e.g., timeout), ensure state is recoverable | |
| if (it is CancellationException) { | |
| Logger.warn("Node stop cancelled, forcing Stopped state for recovery", context = TAG) | |
| _lightningState.update { LightningState(nodeLifecycleState = NodeLifecycleState.Stopped) } | |
| return@withLock Result.failure(it) | |
| // On cancellation (e.g., timeout), ensure state is recoverable | |
| if (it is CancellationException) { | |
| Logger.warn("Node stop cancelled, forcing Stopped state for recovery", context = TAG) | |
| _lightningState.update { LightningState(nodeLifecycleState = NodeLifecycleState.Stopped) } | |
| throw it // Rethrow to properly propagate cancellation | |
| } |
| }.onFailure { | ||
| // On cancellation (e.g., timeout), ensure state is recoverable | ||
| if (it is CancellationException) { | ||
| Logger.warn("Node stop cancelled, forcing Stopped state for recovery", context = TAG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: Setting state to Stopped when JNI operations may still be running
When a timeout-induced CancellationException occurs, this code forces the state to NodeLifecycleState.Stopped. However, the native node.syncWallets() and node.stop() JNI calls may still be executing on the ServiceQueue.LDK single-threaded dispatcher. JNI calls are not interruptible by coroutine cancellation - they will continue running until completion.
Consequences:
-
State machine corruption: The lifecycle state says
Stoppedbut the node is still actively performing I/O, persisting state, etc. -
Mutex release allows concurrent operations: When
return@withLockexecutes, thelifecycleMutexis released. Ifstart()is called next (e.g., on app relaunch), it will:- Acquire the mutex
- See state
Stopped - Proceed with startup
- Find
lightningService.nodeis non-null (since line 259 in LightningService.kt hasn't executed yet) - Skip
setup()and dispatchnode.start()to the LDK queue - The still-running
stop()block eventually completes and sets[email protected] = null - Result: Repo thinks node is
Running, but service's node reference isnull
-
Recovery mechanism doesn't help: The stuck-
Stoppingrecovery at lines 282-286 is never triggered in the timeout scenario because state was already forced toStopped.
Possible solutions:
- Leave state as
Stoppinginstead of forcing toStopped- this allows the stuck-state recovery mechanism to handle it on nextstart() - Add a flag in
LightningServiceto track whether native operations are truly complete, independent of coroutine cancellation - Rethink the timeout approach - consider whether the timeout should apply to the entire operation or just serve as a best-effort mechanism with acknowledgment that native cleanup continues in background
Fixes #739
This PR prevents channel state divergence during node shutdown by ensuring state is fully persisted before the service is destroyed.
Description
When the app was stopped while a 0-conf channel had uncommitted state updates, the client (LDK) could end up with a different commitment height than the LSP. On reconnect, the LSP detected this mismatch as "possible data loss" and force-closed the channel.
This PR adds two mitigations:
syncWallets()before stopping the node to ensure the latest channel state is persisted to VSSrunBlockinginonDestroy()to wait for the node to fully stop before the service is destroyed, with a 5-second timeout to avoid ANRPreview
CJIT.webm
multiple-transactions-and-poor-signal.webm
QA Notes
1. Test graceful shutdown
Performing final sync before shutdown…Final sync completedonDestroy startedonDestroy completed2. Test restart after stop
3. Regression